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1, INTRODUCTION 

Part-of-speech tagging (POS) refers to each word of one sentence assigned to an appropriate part-of- 
speech tagging [1]. That is the procedure to identify each noun, verb, adjective or other parts of speech, 
which is known as the POS tagging [1]. POS tagger has been gaining widespread attention in the field of 
linguistics. The use of POS tagger has been applied in lexical feature extraction for word clustering [2], 
Twitter [3], and medical blogs [4]. Compared to other languages [5] such as English [6-7], the development 
of the Malay language corpora in Malaysia is still lagging behind. To the best of our knowledge, there is yet 
to be a Malay language corpora that compile a specific and detailed list of criminological terms in Malay. 

Linguistics literature [8] has highlighted how the Malay language has many loanwords from others 
languages. Since then, large-scale linguistic works have been established. Tasks such as word tagging and 
tokenizing are done in many different languages, including Arabic [9], Hebrew [10], German [11], Urdu [12], 
Burmese [13], Russian [14], Chinese [15] and Swedish [16]. In other words, the process of text segmentation 
involved in these studies has been used in many different languages for text analysis [17]. 

It is unarguably true that English is one of the most usable and established compared to any other 
language. Although several Malay corpora analysis has been conducted, the development of the English 
language remains an example at all times, at least both of the information on newspapers (e.g. Utusan Online 
or Berita Harian) only have a general tag to search for all the crime news online, which is “jenayah” in 
Malay. The word “crime” is too abstract and broad term, and yet limited to be of any help to forensic 
linguistic users. In particular, professionals in crime-related fields such as police, lawyers and forensic 
scientists may find it is helpful to search for materials related to crime with such a list of terms in Malay. 
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With such list of tags, the availability of more relevant information will be available for crime-related fields 
academic or research purposes. While the Malay language is a medium of instruction in education, 
the majority of online communication in Malaysia remains to be in English [18]. Furthermore, the Malay 
language has yet to have a specific list of crime-related terminologies developed for crime-related news or 
information search. 

Thus, this study aims to look at the creation of crime-related words by identifying the most 
frequently used criminology terms from online news articles. The rest of this paper is organized as follows: 
Section 2 presents the literature review, and Section 3 describes the method of this study using human coder 
and sentiment tools’ setups. Section 4 contains an analysis of the survey results, followed by the evaluation 
of sentiment tools’ testing. Section 5 then concludes the paper. 


2. LITERATURE REVIEW 
2.1. Crime 

Crime has always been a big societal issue, regardless of whether it is a knife crime or a cybercrime. 
In 2018, the crime rate in Malaysia is still on the rise [19]. In the worldwide crime index, Malaysia is ranked 
at number 15 (63.05%), while the United States is at number 35 (49.58%) and the United Kingdom at 
number 62 (41.20%). Malaysia’s crime index was rated at 70.88% in 2012, decreased to 67.50% in 2014 and 
rose again to 69.70% in 2015. 

Until 2017, the crime index had decreased to 63.05%. These numbers are still considered high, and 
Malaysia is still a country that is plagued by crimes. A recent open source statistic report of Malaysia has 
categorized crimes into two main categories [19]: 1) acts of violence, and 2) property damage. As shown in 
Table 1, these two categories can be separated into seven and six subcategories, respectively. 

The subcategories in Table 1 show the various types and amount of crimes that are being committed 
in the country. Therefore, it can be deduced that it is a significant aspect that further analysis must be 
considered. However, the statistics only consider two different crime categories that are present in the 
country and do take into account other categories of crime that exist. 


Table 1. Malaysian Crime Categories 


Crime Categories Amount (the year 2016) 
Acts of Violence 
Murder 456 
Rape 1886 
Robbery: Accomplices with Firearms 65 
Robbery: Accomplices without Firearms 10,907 
Robbery: Firearms 18 
Robbery: Without Firearms 3463 
Wounding 5531 
Property Damage 
Theft 19894 
Car Theft 10607 
Motorcycle Theft 34754 
Heavy Vehicle Theft 3050 
Snatch Theft 2963 
Breaking, Entering and Stealing / Burglary 18760 
Total Crime Index 112354 


2.2. How Does Literature Categorize Crime Terminologies? 

Many types of research have been done to categorize crime terminologies [20-22]. From mentioned 
literature, crime can be primarily categorized into the following seven categories: 1) property theft 2) violent 
crime 3) controlled substance/drug 4) terrorism 5) abuse 6) white collar crime, and 7) forced labour. As 
shown in Table 2, each of these broad categories of crime can then be broken down into different 
subcategories [20, 23-26]. 

In data classification, it 1s essential to group terms that share a common characteristic, meaning or 
quality. With the classification in Table 2, the process of categorizing crime terminologies becomes clearer. 
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Table 2. Categories of Crime 
Major Categories (number of crimes) Subcategories 
Theft, Car Theft, Motorcycle Theft, Heavy Vehicle Theft, Snatch Theft, 


*K 
propery TEN K@:) Breaking, Entering and Stealing /Burglary. 
Violent Crimes (7*) Jenayah Murder, Rape, Armed Robbery with Accomplices, Unarmed Robbery without 
Kekerasan Firearms, Armed Robbery, Unarmed Robbery, Wounding. 


Trafficking, Drug Possession, Controlled Substance Violation and Other 
Crimes/Activity, Racketeering, Smuggling, Laundering Money from Controlled 
Substances, Tax Offenses. 

Cyber Terrorism, State Terrorism, State Sponsored Terrorism, Nationalist 


Controlled Substances/Drugs (7*) 
Bahan-Bahan Terkawal 


Terrorism (8*) Pengganasan Terrorism, Religious Terrorism, Left and Right Wing Terrorism, Anarchist 
Terrorism, Suicide Terrorism. 

Abuse (7*) Child Abuse, Physical Abuse, Emotional Abuse, Sexual Abuse, Neglect, 

Penderaan Bullying, Financial Exploitation. 

White-Collar Crime (8*) Jenayah Antitrust, Securities Fraud, Mail Fraud, False Claims, Credit Fraud, Bribery, Tax 

Kolar Putih Fraud, Bank Embezzlement. 


Forms of coercion, Prison Labour, Forced Overtime, Human Trafficking, 
Trafficking or Smuggling, Slavery, Child Labour, Bonded Labour. 


*The number of subcategories in the category Italic words are in the Malay language 


Forced Labour (8*) Buruh Paksa 


A common way of categorizing a keyword is through keyword extraction [27]. This process is done 
based on the available list of keywords to accommodate the categorization of other keywords into those 
categories. However, an issue that may arise is that while there are subcategories that represent the general 
category of crime terms, there is no evidence or method to show that some types of crime belong in a 
particular subcategory, especially in the Malay language [27]. The use of keyword-based categorization to 
classify text into a corresponding category requires approximately 30 keywords to represent each category. 

In this study, there are no keywords that are used to represent each category of crime. There is 
only a list of English words for crime and general terms without a source of references to their major 
categories [28]. Thus, making a list of words for crime is essential. As seen in Table 2, several major crime 
categories and their subcategories have been summarized and tabulated. This study aims to develop a list of 
crime-related Malay terminologies. However, it has also assisted us in producing a list of English 
terminologies. Until today, Malaysian police reports and documents are still written in the Malay language. 

In Malaysia’s online news content (crime news) are generally tagged as ‘crime’ or ‘jenayah’. 
No website is found to provide a list of tags that give further insight into the specific crime that the content 
belongs to. To fill the gap, the main aim of the study is to create a list of crime-related Malay terminologies 


3. MATERIALS AND METHODS 
3.1. Phase 1: Data Collection 

The first stage of this study was to collect news from online newspapers in the Malay language, 
particularly news and articles that related to crime. Initially, 200 news articles were compiled. Manually, 
all words from the articles were recorded in a database, which separated the words by dates. For each year 
between 2014 and 2017, at least 50 articles were manually recorded. The number of selected online articles 
from Utusan Online was 71 (www.utusan.com.my), 60 from Berita Harian (www.bharian.com.my) and 69 
from Harian Metro (www.hmetro.com.my). These websites generally feature newspaper articles for all 
categories and are written in the Malay language. The use of these newspaper articles makes it possible not 
only to obtain unique information of the way in which each newspaper reports or writes crime-related 
content, but also to consider the types of crime that have been, and are being, reported. 

A random sampling method was used to select the articles and to ensure that the data collected was 
not biased [29]. The sampling method was carried out by using a random number generator, the maximum 
limit is the number of articles available on the newspaper webpage. 


3.2. Phase 2: Pre-processing of Data using Human Coder 

Due to the lack of Malay sentiment tools, four human coders were used to read each newspaper 
article and verify the news content. Through the random sampling method, some collected articles were 
found to be irrelevant. For example, under the list of crime articles in Berita Harian, news on ‘accidents’ had 
been erroneously included. To overcome this issue, each news article was read through by human coders and 
would be removed from a list of top 500 crime-related keywords search if the article was unsuitable. 

The second issue that had to be countered during the data pre-processing stage was the presence of 
duplicate news from different newspapers. Therefore, each article was regarded as a distinct piece of news as 
the authors of the news article might have used different terms to write a similar story. This particular issue 
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still has to be studied illustrates the flow of data pre-processing, and the following phases (feature selection, 
evaluation etc.). Flow of Data Pre-processing, Feature Selection to Evaluation as shown in Figure 1. 
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Figure 1. Flow of Data Pre-processing, Feature Selection to Evaluation 


3.3. Phase 3: Processing the Data 

From the literature review, a total of 52 subcategories of crime were identified and summarized 
Table 2. The reduction process was done because there was an overlapping of attributes (words) that 
appeared in different categories. Furthermore, to ensure that the words in the final list have no similarity in 
meaning, a crime vocabulary in English was used to distinguish the semantic meaning of the words. This step 
was to benchmark all semantic meaning of each English word to a Malay meaning using four human coders. 

Therefore, a list of wide-ranging crime vocabulary in English was obtained online from Cambridge 
Dictionary and Oxford Dictionary. The dictionaries were also used to translate English words into Malay, as 
sometimes one dictionary alone would not be able to provide the Malay equivalent of a word semantically. 
The human coder, therefore, had to determine the outcome. If no Malay translation of a word could be found 
in either dictionary, then the English-Malay Google Translate tool would be used to attain a rough 
translation. 


3.4. Phase 4: Processing the Training Set of Data 

This step was performed to create a list of categorized newspaper articles by comparing the list of 
words that appeared in the news with the list of Malay-translated words gathered from the previous step 
(Phase 3). When the text in the news article has a more frequent appearance of crime words in a specific 
category list (e.g. Murder), then the news will be categorized under that particular category. 


3.5. Phase 5: Using the WEKA 

Using WEKA [30], the dataset which was originally a collection of text in String format was 
converted into each word or attribute using the StringToWordVector function. In this step, unnecessary 
attributes (for example, ‘ada’, ‘akan’, etc., in Malay) which may negatively affect the data due to an 
overlapping of words were filtered and removed using the keywords that could best help the classification 
prediction were obtained. Table 3 shows an example of a list of words (attributes) that were selected from the 
Correlation AttributeEval feature selection. 


Table 3. Crime Categories and Number of Related Words 


Category Number of Words 
Violent Crimes 129 
Property Theft 72 
Abuse TY 
Forced Labour 47 
White-collar Crime 74 
Controlled Substances ey 
Terrorism 32 
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3.6. Phase 6: Feature Selection 

Phase 6 involved feature selection, also known as attribute selection, to remove noise feature. 
In this study, the GainRatioAttributeEval, InfoGainAttributeEval and CorrelationAttributeEval feature 
selection algorithms and applied rank search as algorithms were used. By using three different feature 
selection algorithms, the consistency of the study’s evaluation could be proven. The best 500 extracted 
keywords that could best help the classification prediction were obtained. Table 3 shows an example of a list 
of words (attributes) that were selected from the CorrelationAttributeEval feature selection. 


3.7. Phase 7: Model Evaluation/Validation 

In Phase 7, the classified set of terms was evaluated. Naive Bayes classifier was used to categorize 
the dataset as it is a simple probabilistic classifier which is effective in analyzing text in many domains. 
Particular classier was selected because it was successfully applied in text analysis in the past study of [31]. 
Moreover, since there were seven different categories of crime to be classified, Naive Bayes was chosen as it 
is known for multi-class prediction which could generate better output for text analysis. The output model 
was evaluated through correctly classified instances, incorrectly classified instances, recall, precision, F- 
measure, and ROC Area. 


4. RESULTS 
4.1. Part 1: List of crime words according to the category 

List of words was gathered through the process of word searching related to each crime category. 
From the seven categories of crime, a total of 724 crime terminologies were collected. Following the 
conversion of words into Malay, a total of 521 crime words were left, one of the examples can be seen in 
Figure 2. Due to the nature of language, some Malay-translated words appeared to be simular. It follows that 
if similar words appeared within the same category, it would be eliminated thus reducing the redundancy. 
Table 4 shows the number of words that represent each crime category. The words for each category were 
then utilized to categorize the training set. 


508 | Pengganasan 7 militan 

509 Pengganasan 7 aktivis 
310|/Pengganasan 7 pengeboman 
o117)/Pengganasan ¥ osama 
512|Pengganasan 7 laden 
513|Pengganasan 7 gerila 
514|Pengganasan 7 bunuh diri 
215|Pengeanasan 7 propaganda 
316) Penggeanasan 7 kempen 
517|Pengganasan 7 al-qaeda 
518|Pengganasan 7 bangkit 
319) Pengganasan fis 

520 Pengganasan ¥ tentera 
521|Pengganasan 7 taliban 


Figure 2. Malay crime category and list of words 


4.2. Part 2: News Categorization for Training Set 

Categorization process was done for news text. The frequency of the words and the category to 
which they belonged determined the category of the text as a whole. Thus, the training set containing the text 
and its corresponding crime category was developed. 


4.3. Part 3: Data Pre-processing 

In this process, the dataset pre-processing was applied to the original dataset. By applying an 
unsupervised method of filtering using StringToWordVector in WEKA, each word in the text was converted 
into its attribute. This led to an increase in the total number of attributes. The applied stop words then filtered 
the attributes by matching the same words to the existing attributes. This pre-processing phase helped obtain 
attributes in the training and test datasets. 
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4.4. Part 4: Feature Selection 

A list of 500 most relevant attributes in the content of the output was finalized. Attributes with a low 
correlation were dropped from the list and thus improved the classifier’s prediction, as they would no longer 
affect the output. CorrelationAttributeEval was applied as feature selection. Ranker Search method was 
applied as well. 


4.5. Part 5: Classification 

The results of the Naive Bayes classifier using four different feature selections are shown in Table 4. 
This evaluation displays the accuracy of the model based on the datasets that were input into the WEKA 
Machine Learning tool. 


Table 4. Results of Classifier Accuracy 


Correctly Incorrectly 
Classified Classified Kappa 


Cassin peatunemeter ton Instance Instance Statistics 
(%) (%) 
Naive Bayes None 82.50 17.50 0.7882 
Naive Bayes GainRatioAttributeEval 78.75 Zo 0.7425 
Naive Bayes InfoGainAttributeEval 78.75 21.25 0.7425 
Naive Bayes _ CorrelationAttributeEval 83.75 16.25 0.8040 
Average 80.94 19.06 0.7693 


From Table 4, it can be seen that the correctly classified instance based on the weighted average of 
the four results is 80.94%. This not only shows the classification’s high accuracy but also signifies that of the 
80 instances from the test dataset, the model managed to validate 80.94% of them. Kappa statistic represents 
agreement range between observers and perfect agreement is equal to a kappa of 1 [32]. Based on the kappa 
statistics, the average of 0.7693 suggests that the classification did not provide much room for “random 
guessing”. To obtain a more comprehensive analysis of the results, the detailed analysis of WEKA 
outputs was studied. 

Table 5 shows the accuracy of the analysis based on each class from the Correlation AttributeEval 
feature selection output. Based on the average of precision = 0.882, recall = 0.838 and f-measure = 0.839, the 
results suggest that the classification was reliable and accurate for most classes. The ROC [33] area also 
produced a high statistic (ROC Area = 0.980), reflecting high accuracy in the test. Accuracy is measured by 
the area under the ROC curve, whereby the closer the curve is to the Y-axis, the better the result will be. 

Figure 3 features the top 10 words from the seven crime categories. The classifier with the 
Correlation AttributeEval feature selection with the highest accuracy is shown in Table 5. The attributes from 
the classifier were selected from the output of the feature selection process, and the words (attributes) that 
matched the list of crime words were selected to be in the top 10 words from the crime category. Figure 3 
records the results where each category has its own set of top 10 words followed by the rank of each word, 
which affects the text classification. 

While there are words that identify each category, there is the issue of overlapping words in more 
than one category. For instance, in the ‘Jenayah Hartabenda’ and ‘Jenayah Kolar Putih’ categories, the word 
‘curl’ 1s evident in both. The classifiermay manage to classify the text into its corresponding category due to 
other related words within a particular category. 


Table 5. A detailed analysis based on the Naive Bayes Classifier with Correlation Attribute Eval feature 


selection 
Class TP Rate FP Rate Precision Recall F-Measure MCC ROC Area _ PRC Area 
Bahan-bahan Terkawal 0.778 0.000 1.000 0.778 0.875 0.855 0.972 0.955 
Buruh Paksa 0.833 0.000 1.000 0.833 0.909 0.907 0.928 0.860 
Jenayah Hartabenda 0.778 0.000 1.000 0.778 0.875 0.855 0.989 0.970 
Jenayah Kekerasan 1.000 0.154 0.600 1.000 0.750 0.713 0.985 0.936 
Jenayah Kolar Putih 1.000 0.029 0.833 1.000 0.909 0.900 0.990 0.906 
Penderaan 0.429 0.014 0.750 0.429 0.545 0.538 0.975 0.800 
Pengganasan 1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 
Weighted Ave. 0.838 0.034 0.882 0.838 0.839 0.821 0.980 0.931 
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Figure 3. Top 10 words from each crime category 


Table 6 represents the category of crime and its respective texts. The frequency of the words in each 
text contributes to the words that describe the category. At least one of the top 10 words used in each 
category is present within the text. For example, the words ‘mati’, ‘mayat’ and ‘cedera’ are among the top 10 
words, which describe the prevalence of violent crimes in the representative text. 


Category 
Jenayah 
Kekerasan 


Jenayah 
Hartabenda 


Jenayah 
Kolar Putih 


Penderaan 


Table 6. The Category, Representative Sentences and Words Describing the Category 
Representative sentence Words describing category 

seorang lelaki warga indonesia mati selepas terbabit dalam pergaduhan dengan rakan 
senegaranya di kediaman mereka di kampung buluh penyumpit, mukim kuah di sini, bee cedera 
hartimt. . —— . lelaki. _ mayat 
ketua bahagian siasatan jenayah langkawi, asisten superintendan bee anak amba, rakannya 
berkata mayat lelaki berusia 38 tahun yang belum dikenali itu ditemui berlumuran "UMah 

; : : - . Slasatan 
darah di atas sofa dalam rumah terbabit pada jam 6. 15 pagi. siasatan awal mendapati terbhabit 
mereka bergaduh sebelum maut manakala rakannya cedera. 
empat lelaki yang cuba merompak kedai emas di jalan besar sasaran, kuala selangor, 


cermin 
pagi semalam, melarikan diri dengan tangan kosong selepas gagal memecahkan cermin cuba emas em Pa 
pameran barang kemas. ketua polis daerah kuala selangor, superintendan ruslan Revadian _ kemas.. 
abdullah berkata, kejadian berlaku pada 11.35 pagi dan tiada pelanggan ketika itu. kuala velaki...elarikan 
menurutnya : 
pagi pameran  pekerja 
selangor 
dua konstabel polis ditahan suruhanjaya pencegahan rasuah malaysia (sprm) petang 
eee ae i aie ere a . = anggota 
adi selepas disyaki meminta rasuah daripada ceti haram atau along di sungai petani. a... Marea 
sumber berkata, kedua - dua anggota berusia 34 dan 37 tahun itu ditahan sprm 
cawangan sungai petani pada 2 petang tadi. anggota polis terbabit ditangkap kerana | 
: ; ; ‘ pengadu peti 
terbabit dalam permintaan wang rasuah berjumlah rm 10,000 daripada pengadu yang petani rasuah sprm sun! 
menjalankan kegiatan peminjaman wang haram. terbabit un 
wang 


seorang wanita hong kong disabit kesalahan memukul, menyeksa, dan membiarkan 
pembantu rumahnya yang juga warga indonesia kelaparan, dalam kes yang 
mencetuskan kemarahan penduduk republik negara tersebut, tahun lalu. keputusan itu TH pembantu 
dibacakan di dalam kamar mahkamah, disambut sorakan penyokong erwiana penderaan 

sulistyaningsih yang merupakan bekas pembantu rumah, law wan - tung. wan-tung, 44, j i j 

ibu kepada dua orang anak itu, ditangkap pada januari tahun lalu dan hukuman tea a tahun 
terhadapnya akan diputuskan pada 27 februari ini. 


benar hakim 
mahkamah 
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Buruh Paksa seramai 17 warga asing termasuk enam kanak - kanak berjaya diselamatkan oleh polis asing buruh 
semalam, selepas dikesan menjadi buruh paksa satu sindiket untuk mengemis didiselamatkan dua enam lelak 
beberapa lokasi pasar malam di puchong. ketua penolong pengarah bahagian kongsi \okas} malam mengemis 
gelap, judi dan maksiat (d7) bukit aman senior asisten komisioner rohaimi md isa mengikut . paksa . pasar 
berkata, semua pengemis berusia antara dua tahun hingga 50-an yang diselamatkan itu Siasatan SINC] et 
terdiri daripada dua lelaki, sembilan wanita dan enam kanak - kanak. termasuk...... untuk. wanita 


Bahan- polis menahan lima individu, termasuk tiga warga asing dan merampas pelbagai jenis aman_. 
bahan dadah dianggarkan bernilai rm 6.7 juta sepanjang awal bulan ini sehingga kelmarin. bemiral . bukit 
; eee .. dadah..... dianggarkan 
Terkawal pengarah jabatan siasatan jenayah narkotik (jsjn) bukit aman, datuk seri noor rashid ekstasi 
ibrahim, berkata polis turut berjaya membongkar satu makmal dadah memproses dan juta kondominium 
membungkus pil ekstasi yang beroperasi di sebuah kondominium di jalan kuchai maju pil satu 
pada jumaat lalu. 
Pengganasanpihak berkuasa turki telah membunuh hampir 900 orang yang didakwa anggota 
kumpulan militan negara islam (is) sejak januari lalu, kata agensi berita kerajaan, afatolia daerah. ghor. hari 
anatolia yang memetik sumber ketenteraan negara itu. menurut anatolia, daripada kira. Kumpulan 
jumlah itu, seramai 492 ' pengganas ' telah dibunuh menerusi serangan udara manakala militan. negara 
370 lagi terbunuh dalam beberapa serangan meriam yang memusnahkan depot senjata serangan taliban 


mereka. bagaimanapun, angka kematian itu tidak dapat disahkan secara bebas setakat taywara terbunuh 
ini. 


5. CONCLUSION AND FUTURE WORK 

Based on the validation of the classification from the Machine Learning tool on different feature 
selections, the results of recall = 0.838, precision = 0.882, f measure = 0.839 and ROC Area = 0.980 proved 
that the determined results are accurate. It can also be concluded that the word list used to categorize the text 
from the articles is accurate since the averaged correctly classified instance was recorded at 80.94%. 
Moreover, the built model was able to generate a high percentage of correctly classified instances. Therefore, 
the 521 words in the crime word list can be used in future work to assist in the tagging of crime in the Malay 
language. 

Following the satisfactory results obtained in this study, it is suggested that in future research, 
a stemmer/lemmatizer could be applied to the dataset to acquire a cleaner dataset. Stemming 1s the process of 
reducing derived words, so that a general term could be generated. In this study, the attributes contained a 
multiple of the same words but with different prefixes onto it such as ‘mem-*, ‘per-‘, ‘-an’ etc. Due to these 
prefixes, the filtered dataset still carried attributes that represent the same words in different forms. 
Therefore, the application of a lemmatizer would be able to produce a more legitimate set of words. 

One of the improvements for future study can be dealing with the multi-classification of the words. 
When the text can exist in more than one category, known as multi-label classification. Therefore, in future 
work, the multi-label classification should be taken into consideration for instances where words may exist in 
more than one category. 
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